This lecture is based on Supervised Machine Learning for Text Analysis in R.
Similar to our previous work we can have both regression and classification text based problems. A regression model would predict a numeric/continuous output ‘such as predicting the year of a United States Supreme Court opinion from the text of that opinion.’ A classification model would predict a discrete class ‘such as predicting whether a GitHub issue is about documentation or not from the text of the issue.’
Natural language needs to be standardized and transformed to numeric representations for modeling. We will use the textrecipes package to do this.
What language is and how language works is key to creating modeling features from natural language. Words in English are made of prefixes, suffixes, and root words. Defining a word can be quite difficult with compound words (like real estate or dining room). Preprocessing natural language has three primary steps: tokenization, removal of stop words, and even stemming.
Tokenization can broadly be thought of as taking an input (such as a string) and a token type (such as a word) and splitting the input into pieces/tokens. This process is generally much more complex than you might think (e.g. more than splitting on non-alphanumeric characters) and spaCy through the (tokenizers package) implements a fast tool set.
Tokens have a variety of units including characters, words, sentences, lines, paragraphs, and n-grams.
sample_vector <- c("Far down in the forest",
"grew a pretty little fir-tree")
sample_tibble <- tibble(text = sample_vector)
tokenize_words(sample_vector)
## [[1]]
## [1] "far" "down" "in" "the" "forest"
##
## [[2]]
## [1] "grew" "a" "pretty" "little" "fir" "tree"
sample_tibble %>%
unnest_tokens(word, text, token = "words")
sample_tibble %>%
unnest_tokens(word, text, token = "words", strip_punct = FALSE)
pride <- tibble(line=janeaustenr::prideprejudice)
pride %>%
unnest_tokens(word, line) %>%
count(word) %>%
arrange(desc(n))
n-gram is a contiguous sequence of n items from a given sequence of text. Examples:
token_ngram <- tokenize_ngrams(x = pride %>% pull(line),
lowercase = TRUE,
n = 3L,
n_min = 3L,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE)
token_ngram[[100]]
## [1] "are my old" "my old friends"
## [3] "old friends i" "friends i have"
## [5] "i have heard" "have heard you"
## [7] "heard you mention" "you mention them"
## [9] "mention them with" "them with consideration"
library(jiebaR)
## Loading required package: jiebaRD
words <- c("下面是不分行输出的结果", "下面是不输出的结果")
engine1 <- worker(bylines = TRUE)
segment(words, engine1)
## [[1]]
## [1] "U" "4" "E0B" "U" "9762" "U" "662" "F" "U" "4"
## [11] "E0D" "U" "5206" "U" "884" "C" "U" "8" "F93" "U"
## [21] "51" "FA" "U" "7684" "U" "7" "ED3" "U" "679" "C"
##
## [[2]]
## [1] "U" "4" "E0B" "U" "9762" "U" "662" "F" "U" "4"
## [11] "E0D" "U" "8" "F93" "U" "51" "FA" "U" "7684" "U"
## [21] "7" "ED3" "U" "679" "C"
Some words carry less information than others. For example, a, the, or of. These common words are called stop words and are generally removed entirely. Let’s use the stopwords package here to provide some lists.
pride_words <-
pride %>%
unnest_tokens(word, line)
pride_words %>%
semi_join(get_stopwords(source = "snowball")) %>%
distinct() # present stop words
## Joining, by = "word"
pride_words %>%
anti_join(get_stopwords(source = "snowball")) %>%
distinct() # unique non stop words
## Joining, by = "word"
What if we aren’t interested in the difference between banana and bananas? The core sentiment of a word is often the same (e.g. ‘banana’).
pride_words %>%
anti_join(get_stopwords(source = "snowball")) %>%
mutate(word_stem = wordStem(word))
## Joining, by = "word"
pride_words %>%
anti_join(get_stopwords(source = "snowball")) %>%
mutate(word_stem = wordStem(word)) %>%
summarize(nword = n_distinct(word),
nstem = n_distinct(word_stem))
## Joining, by = "word"
Stemming reduces the feature space of text data but may change the underlying meaning of some sentences. It may or may not improve models.
A data structure for text data.
complaints <- read_csv("https://github.com/EmilHvitfeldt/smltar/raw/master/data/complaints.csv.gz")
## Rows: 117214 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (15): product, sub_product, issue, sub_issue, consumer_complaint_narrat...
## dbl (1): complaint_id
## date (2): date_received, date_sent_to_company
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
complaints %>%
slice_sample(n=10000) %>%
unnest_tokens(word, consumer_complaint_narrative) %>%
anti_join(get_stopwords(), by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(complaint_id, stem) %>%
cast_dfm(complaint_id, stem, n)
## Document-feature matrix of: 10,000 documents, 13,649 features (99.59% sparse) and 0 docvars.
## features
## docs 130.00 15 account alreadi amount attempt attornei avail care collect
## 3113809 1 1 2 1 1 1 2 1 1 1
## 3113817 0 0 0 0 0 0 0 0 0 0
## 3113929 0 0 7 0 0 1 0 0 0 0
## 3113930 0 0 1 0 0 0 0 0 0 0
## 3113969 0 0 0 0 6 0 0 0 0 0
## 3113974 0 0 0 0 1 0 0 0 0 1
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 13,639 more features ]
This is a sparse matrix (where most elements are zero). This is because most documents do not contain most words.
We could also represent text data with weighted counts. The term frequency of a word is how frequently a word occurs in a document, and the inverse document frequency of a word decreases the weight for commonly-used words and increases the weight for words that are not used often in a collection of documents.
\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\] These two quantities can be combined to calculate a term’s tf-idf (the two quantities multiplied together). This statistic measures the frequency of a term adjusted for how rarely it is used, and it is an example of a weighting scheme that can often work better than counts for predictive modeling with text features.
complaints %>%
slice_sample(n=10000) %>%
unnest_tokens(word, consumer_complaint_narrative) %>%
anti_join(get_stopwords(), by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(complaint_id, stem) %>%
bind_tf_idf(stem, complaint_id, n) %>%
cast_dfm(complaint_id, stem, tf_idf)
## Document-feature matrix of: 10,000 documents, 13,387 features (99.58% sparse) and 0 docvars.
## features
## docs 15000.00 2018 300.00 60 account balanc
## 3113820 0.346224 0.1227675 0.09185327 0.07611636 0.03537895 0.03990726
## 3113828 0 0 0 0 0 0
## 3113842 0 0 0 0 0.06420624 0
## 3113898 0 0 0 0 0 0
## 3113928 0 0 0 0 0.03487063 0
## 3114007 0 0 0 0 0 0
## features
## docs calendar check citibank consecut
## 3113820 0.1207042 0.0778987 0.1844487 0.1166088
## 3113828 0 0 0 0
## 3113842 0 0 0 0
## 3113898 0 0 0 0
## 3113928 0 0 0 0
## 3114007 0 0 0 0
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 13,377 more features ]
Creating these matrices is very memory intensive!
While you can create your own embeddings, pre-trained word embeddings suhc as GloVe which is training on Wikipedia and news sources are readily available.
library(textdata)
glove6b <- embedding_glove6b(dimensions = 100)
glove6b